feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB)#1232
Conversation
- Scaling study: PR940 architecture at 20k steps achieves 1.0929 BPB (legal TTT) - Improves on prior GEPA 20k (1.0983 BPB) by -0.0054 BPB - FlowRefiner variant at 1.0928 BPB confirms auxiliary flow head is neutral - Trained on 1xA100-40GB, ~10.7h per run - Artifact: 14,473,337 bytes (base), 14,635,871 bytes (flow) - Includes scaling trajectory 7k-20k steps and comparison table
Community Review — feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB)BPB: 1.0929 | Compliance: FLAG — hashed n-gram cache with target-in-key (PR #779 family pattern) What I found in the code (head SHA The n-gram lookup key at line 1475 is constructed by XOR-ing the target token into the hash: This matches the Per Issue #1017 condition 1, Cluster context: this same structural pattern has been closed on 15+ PRs under the #779 ruling as of 2026-04-11 (#779 itself, #770, #798, #808, #825, #786, #797, #909, #940, #761, #776, #788, #774, #778, #715, #758, #702 upstream, #1488). The base neural model is unaffected by this flag — in every case where the authors resubmitted without the n-gram cache, the base val_bpb has been in the ~1.10-1.15 range (standard for the SP1024 11L class). CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.11s, dim=512, layers=11, vocab=1024, code=115032 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — target-in-key hashed n-gram cache, same family as PR #779. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as the rest of the family-bug cluster. A context-only resubmission (drop the target from the lookup key and use a full-vocabulary reweighting from a single context row, per @valerio-oai's suggested legal path on #779) would be welcomed. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.11s, dim=512, layers=11, vocab=1024, code=115032 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
val_bpb = 1.0929 (base) / 1.0928 (flow) | Pre-TTT: 1.1005 / 1.1000 | Artifact: 14.47 MB / 14.64 MB
Headline Result
Extending the PR #940 architecture stack to 20,000 steps (8,000 peak-LR + 12,000 warmdown) achieves 1.0929 BPB with legal score-first TTT — improving on our prior GEPA 20k submission (1.0983 BPB) by −0.0054 BPB. This improvement comes entirely from architectural upgrades (gated attention, value residual, all-layer XSA, LeakyReLU²) introduced in the PR #549→PR #940 evolution, applied at the same 20k training scale.
Two configurations were trained:
FlowRefiner adds 98,625 parameters and provides negligible benefit at 20k steps (−0.0005 BPB no-TTT, −0.0001 BPB with TTT) — the auxiliary flow head is essentially neutral at this training budget.
Comparison with Prior 20k Submission
The prior GEPA 20k submission achieved a larger TTT gain (−0.017 vs −0.008) because its weaker float base left more room for test-time adaptation. The PR940 stack's stronger float base (1.1062 vs 1.1153) means TTT has less to correct — but the net result is still 0.005 BPB better.
Note: The new submission produces a smaller artifact despite using weaker compression (zstd-16 vs zstd-22). This is due to the PR940 architecture producing better-conditioned weight matrices that compress more efficiently.
Scaling Study: 7k → 20k Steps
Training trajectory showing the warmdown phase (steps 8,000–20,000) is the primary driver of improvement:
Key observations:
Quantized Evaluation Summary
Architecture Summary
1/√(layer+1)FlowRefiner (supplementary config only)
Training Details
Quantization Details
TTT (Test-Time Training) Details
SLURM Job Provenance
slurm_pr940_base_20k_ttt.shslurm_pr940_flow_20k_ttt.sheval_base20k_nottteval_base20k_legal_ttteval_flow20k_nottteval_flow20k_legal_tttTraining script:
train_gpt_pr940.py(2601 lines), environment variables control all configuration.Credits
Base architecture and gated attention/value residual (PR #940/#549, @abaybektursun), Muon optimizer (baseline), BigramHash/SmearGate (PR #65, @aquariouserworkman), XSA (PR #187/#265, @Idan3011/@unnir), mixed quant (PR #76), sliding window eval (PR #50, @mattqlf), legal score-first TTT (PR #77, @samacqua), VE/PartialRoPE/LN Scale (PR #315/#374, @jfprincz/@unnir), EMA (PR #65, @aquariouserworkman), LeakyReLU² (PR #549, @abaybektursun), GEPA 20k prior work (@mcclec07), FlowRefiner (PR #1170, @mcclec07), scaling study and this submission (@mcclec07).